{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "666da745",
   "metadata": {},
   "source": [
    "# Medical RAG Research with txtai\n",
    "\n",
    "[txtai](https://github.com/neuml/txtai) is an all-in-one AI framework for semantic search, LLM orchestration and language model workflows.\n",
    "\n",
    "Large Language Models (LLMs) have captured the public's attention with their impressive capabilities. The Generative AI era has reached a fever pitch with some predicting the coming rise of superintelligence.\n",
    "\n",
    "LLMs are far from perfect though and we're still a ways away from true AI. One big challenge is with hallucinations. Hallucinations is the term for when an LLM generates output that is factually incorrect. The alarming part of this is that on a cursory glance, it actually sounds like factual content. The default behavior of LLMs is to produce plausible answers even when no plausible answer exists. LLMs are not great at saying I don't know.\n",
    "\n",
    "Retrieval Augmented Generation (RAG) helps reduce the risk of hallucinations by limiting the context in which a LLM can generate answers. This is typically done with a search query that hydrates a prompt with a relevant context. RAG has been one of the most practical use cases of the Generative AI era.\n",
    "\n",
    "This notebook will demonstrate how to build a Medical RAG Research process with txtai."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a4c7cc15",
   "metadata": {},
   "source": [
    "# Install dependencies\n",
    "\n",
    "Install `txtai` and all dependencies."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a246bb3a",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%capture\n",
    "!pip install git+https://github.com/neuml/txtai"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fc5c27ed",
   "metadata": {},
   "source": [
    "# Medical Dataset\n",
    "\n",
    "For this example, we'll use a [PubMed subset of article metadata for H5N1](https://huggingface.co/datasets/NeuML/pubmed-h5n1). This dataset was created using [`paperetl`](https://github.com/neuml/paperetl), an open-source library for parsing medical and scientific papers.\n",
    "\n",
    "[PubMed](https://pubmed.ncbi.nlm.nih.gov/) has over 38 million article abstracts as of June 2025. `paperetl` supports loading the full dataset with all 38 million articles or just a smaller subset. The dataset link above has more details on how this can be changed for different codes and keywords. This link also has information on how the article abstracts can be loaded in addition to the metadata."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "id": "5dd4145d",
   "metadata": {},
   "outputs": [],
   "source": [
    "from datasets import load_dataset\n",
    "from txtai import Embeddings\n",
    "\n",
    "ds = load_dataset(\"neuml/pubmed-h5n1\", split=\"train\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2d3b1cd8",
   "metadata": {},
   "source": [
    "Next, we'll build a `txtai` embeddings index with the articles. We'll use a vector embeddings model that specializes in vectorizing medical papers: [PubMedBERT Embeddings](https://huggingface.co/NeuML/pubmedbert-base-embeddings). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 81,
   "id": "04e69829",
   "metadata": {},
   "outputs": [],
   "source": [
    "embeddings = Embeddings(path=\"neuml/pubmedbert-base-embeddings\", content=True, columns={\"text\": \"title\"})\n",
    "embeddings.index(x for x in ds if x[\"title\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 82,
   "id": "a801391e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "7865"
      ]
     },
     "execution_count": 82,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "embeddings.count()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "907898a4",
   "metadata": {},
   "source": [
    "# RAG Pipeline\n",
    "\n",
    "There are a number of [prior examples](https://neuml.github.io/txtai/examples/#llm) on how to run RAG with `txtai`. The [RAG pipeline](https://neuml.github.io/txtai/pipeline/text/rag/) takes two main parameters, an embeddings database and an LLM. The embeddings database is the one just created above. For this example, we'll use a [simple local LLM with 600M parameters](https://huggingface.co/Qwen/Qwen3-0.6B).\n",
    "\n",
    "Substitute your own embeddings database to change the knowledge base. `txtai` supports running local LLMs via [transformers](https://github.com/huggingface/transformers) or [llama.cpp](https://github.com/abetlen/llama-cpp-python). It also supports a wide variety of LLMs via [LiteLLM](https://github.com/BerriAI/litellm). For example, setting the 2nd RAG pipeline parameter below to `gpt-4o` along with the appropriate environment variables with access keys switches to a hosted LLM. See [this documentation page](https://neuml.github.io/txtai/pipeline/text/llm/) for more on this."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 83,
   "id": "cd05fc41",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Device set to use cuda:0\n"
     ]
    }
   ],
   "source": [
    "from txtai import RAG\n",
    "\n",
    "# Prompt templates\n",
    "system = \"You are a friendly medical assistant that answers questions\"\n",
    "template = \"\"\"\n",
    "Answer the following question using the provided context.\n",
    "\n",
    "Question:\n",
    "{question}\n",
    "\n",
    "Context:\n",
    "{context}\n",
    "\"\"\"\n",
    "\n",
    "# Create RAG pipeline\n",
    "rag = RAG(embeddings, \"Qwen/Qwen3-0.6B\", system=system, template=template, output=\"flatten\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "98800220",
   "metadata": {},
   "source": [
    "# RAG Queries\n",
    "\n",
    "Now that the pipeline is setup, let's run a query."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 85,
   "id": "4d93ac4b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<think>\n",
      "Okay, let's see. The user is asking about H5N1. The context provided starts with \"Why tell me now?\" and then goes into facts about H5N1. The first sentence mentions that people and healthcare providers are weighing in on pandemic messages. Then it says H5N1 is avian influenza, a potential pandemic.\n",
      "\n",
      "Wait, but the user's question is about H5N1. The context doesn't go into specifics about what H5N1 is, but it does state that it's avian influenza. So I need to make sure I answer based on that. The answer should be concise, maybe mention that H5N1 is avian flu and it's a potential pandemic. Also, note that people are weighing in on messages. But I need to check if there's any more information. The context ends there. So the answer should be straightforward.\n",
      "</think>\n",
      "\n",
      "H5N1 influenza viruses are a type of avian influenza, a potential pandemic influenza virus that could cause widespread illness and death. While the context highlights the importance of public health and preparedness, it does not provide more specific details about its characteristics or risks.\n"
     ]
    }
   ],
   "source": [
    "print(rag(\"Tell me about H5N1\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "39ab9946",
   "metadata": {},
   "source": [
    "Notice that this LLM outputs a thinking or reasoning section then the answer.\n",
    "\n",
    "Let's review the context to validate this answer is derived from the knowledge base."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "id": "53de0f96",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'id': '16775537',\n",
       "  'text': '\"Why tell me now?\" the public and healthcare providers weigh in on pandemic influenza messages.',\n",
       "  'score': 0.7156285643577576},\n",
       " {'id': '22308474',\n",
       "  'text': 'H5N1 influenza viruses: facts, not fear.',\n",
       "  'score': 0.658343493938446},\n",
       " {'id': '16440117',\n",
       "  'text': 'Avian influenza--a pandemic waiting to happen?',\n",
       "  'score': 0.5827972888946533},\n",
       " {'id': '20667302',\n",
       "  'text': 'The influenza A(H5N1) epidemic at six and a half years: 500 notified human cases and more to come.',\n",
       "  'score': 0.5593500137329102},\n",
       " {'id': '18936262',\n",
       "  'text': 'What Australians know and believe about bird flu: results of a population telephone survey.',\n",
       "  'score': 0.5568690299987793},\n",
       " {'id': '30349811',\n",
       "  'text': 'Back to the Future: Lessons Learned From the 1918 Influenza Pandemic.',\n",
       "  'score': 0.5540266036987305},\n",
       " {'id': '17276785',\n",
       "  'text': 'Pandemic influenza: what infection control professionals should know.',\n",
       "  'score': 0.5519200563430786},\n",
       " {'id': '16681227',\n",
       "  'text': 'A pandemic flu: not if, but when. SARS was the wake-up call we slept through.',\n",
       "  'score': 0.5518345832824707},\n",
       " {'id': '22402712',\n",
       "  'text': 'Ferretting out the facts behind the H5N1 controversy.',\n",
       "  'score': 0.5508109331130981},\n",
       " {'id': '25546511',\n",
       "  'text': \"One-way trip: influenza virus' adaptation to gallinaceous poultry may limit its pandemic potential.\",\n",
       "  'score': 0.5494509339332581}]"
      ]
     },
     "execution_count": 79,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "embeddings.search(\"Tell me about H5N1\", limit=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "025e6092",
   "metadata": {},
   "source": [
    "The answer is doing a good job being based on the context above. Also keep in mind this is a small 600M parameter model, which is even more impressive.\n",
    "\n",
    "Let's try another query."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 86,
   "id": "1899047a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<think>\n",
      "Okay, let's see. The user is asking about the locations that have had H5N1 outbreaks, and the provided context mentions a few places: Indonesia and Bangladesh. The context also has a title about a decade of avian influenza in Bangladesh and mentions \"H5N1.\" \n",
      "\n",
      "Wait, the user's question is in English, so I need to make sure I'm interpreting the context correctly. The context includes two sentences: one about a decade in Bangladesh and another about H5N1. The user is probably looking for specific locations where H5N1 has been reported. \n",
      "\n",
      "Looking at the context again, it says \"Human avian influenza in Indonesia\" and \"A Decade of Avian Influenza in Bangladesh: Where Are We Now? Are we ready for pandemic influenza H5N1?\" So the outbreaks are in Indonesia and Bangladesh. \n",
      "\n",
      "I should confirm that there are no other mentions of other locations. The context doesn't provide more information beyond those two countries. Therefore, the answer should list Indonesia and Bangladesh as the locations with H5N1 outbreaks.\n",
      "</think>\n",
      "\n",
      "The locations with H5N1 outbreaks are Indonesia and Bangladesh.\n"
     ]
    }
   ],
   "source": [
    "print(rag(\"What locations have had H5N1 outbreaks?\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 87,
   "id": "39d4efa6",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'id': '21706937',\n",
       "  'text': 'Human avian influenza in Indonesia: are they really clustered?',\n",
       "  'score': 0.6269429326057434},\n",
       " {'id': '31514405',\n",
       "  'text': 'A Decade of Avian Influenza in Bangladesh: Where Are We Now?',\n",
       "  'score': 0.5972536206245422},\n",
       " {'id': '15889987',\n",
       "  'text': 'Are we ready for pandemic influenza H5N1?',\n",
       "  'score': 0.5863772630691528},\n",
       " {'id': '17717543',\n",
       "  'text': 'Commentary: From scarcity to abundance: pandemic vaccines and other agents for \"have not\" countries.',\n",
       "  'score': 0.5844159126281738},\n",
       " {'id': '22491771',\n",
       "  'text': 'Two years after pandemic influenza A/2009/H1N1: what have we learned?',\n",
       "  'score': 0.5812581777572632},\n",
       " {'id': '39666804',\n",
       "  'text': \"Why hasn't the bird flu pandemic started?\",\n",
       "  'score': 0.5738048553466797},\n",
       " {'id': '23402131',\n",
       "  'text': 'Where do avian influenza viruses meet in the Americas?',\n",
       "  'score': 0.5638074278831482},\n",
       " {'id': '20667302',\n",
       "  'text': 'The influenza A(H5N1) epidemic at six and a half years: 500 notified human cases and more to come.',\n",
       "  'score': 0.560465395450592},\n",
       " {'id': '17338983',\n",
       "  'text': 'Human avian influenza: how ready are we?',\n",
       "  'score': 0.555113673210144},\n",
       " {'id': '24518630',\n",
       "  'text': 'Recognizing true H5N1 infections in humans during confirmed outbreaks.',\n",
       "  'score': 0.5501888990402222}]"
      ]
     },
     "execution_count": 87,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "embeddings.search(\"What locations have had H5N1 outbreaks?\", limit=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "efa8e59e",
   "metadata": {},
   "source": [
    "Once again the answer is based on the context which mentions the two countries in the answer. The context also discusses the Americas but it doesn't have as strong of language connecting H5N1 outbreaks to the location."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "18e04661",
   "metadata": {},
   "source": [
    "# Add citations\n",
    "\n",
    "The last item we'll cover is citations. One of the most important aspects of a RAG process is being able to ensure the answer is based on reality. There are a number of ways to do this but in this example, we'll ask the LLM to perform this step."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 96,
   "id": "ce10ff39",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Device set to use cuda:0\n"
     ]
    }
   ],
   "source": [
    "# Prompt templates\n",
    "system = \"You are a friendly medical assistant that answers questions\"\n",
    "template = \"\"\"\n",
    "Answer the following question using the provided context.\n",
    "\n",
    "After the answer, write a citation section with ALL the original article ids used for the answer.\n",
    "\n",
    "Question:\n",
    "{question}\n",
    "\n",
    "Context:\n",
    "{context}\n",
    "\"\"\"\n",
    "\n",
    "def context(question):\n",
    "    context = []\n",
    "    for x in embeddings.search(question, limit=10):\n",
    "        context.append(f\"ARTICLE ID: {x['id']}, TEXT: {x['text']}\")\n",
    "\n",
    "    return context\n",
    "\n",
    "# Create RAG pipeline\n",
    "rag = RAG(embeddings, \"Qwen/Qwen3-0.6B\", system=system, template=template, output=\"flatten\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f79bdc9a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "H5N1 is a type of avian influenza virus.  \n",
      "\n",
      "**Citation Section:**  \n",
      "- ARTICLE ID: 22010536, TEXT: Is avian influenza virus A(H5N1) a real threat to human health?\n"
     ]
    }
   ],
   "source": [
    "question = \"What is H5N1?\"\n",
    "print(rag(question, context(question), maxlength=2048, stripthink=True))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "15539c5d",
   "metadata": {},
   "source": [
    "As expected, the answer adds a citation section. Also note that the RAG pipeline stripped the thinking section from the result."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3bdf84b5",
   "metadata": {},
   "source": [
    "# Wrapping up\n",
    "\n",
    "This notebook covered how to build a Medical RAG Research process with `txtai`. It also covered how to modify this logic to add in your own knowledge base or use a more sophisticated LLM.\n",
    "\n",
    "With an important space such as the medical domain, it's vital to ensure that answers are derived from reliable knowledge. This notebook shows how to add that reliability via RAG. But as with anything in an important domain, there should be a human in the loop and answers shouldn't be blindly relied upon. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "local",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.18"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
